FIRST: Many thanks to Britta Schumacher for originally compiling these materials, and to An Bui & Sam Csik’s tidyverse tutorial, Dr. Simona Picardi’s tidyverse chapter, & Dr. Alison Horst’s tidyverse aRt which helped immensely in preparing this document.

What is the tidyverse?

Artwork by the incredible Alison Horst

The tidyverse is a collection of packages for data cleaning, organizing, manipulating, and more! It’s a toolkit designed for data science that streamlines and simplifies code to increase collaboration and reproducibility. Because all of the packages contained within the tidyverse share an underlying philosophy, grammar, and data structure, it’s easy to combine different data wrangling and visualization tasks with confidence across multiple unrelated datasets. Read more about this incredible toolkit here!

The core tidyverse includes the following packages:

  • dplyr for data manipulation;
    • primary functions: arrange(), filter(), group_by(), mutate(), select(), summarize()
  • tidyr for transforming data to a tidy format;
    • primary functions: pivot_longer(), pivot_wider()
  • ggplot2 for plotting/graphics;
    • there is A LOT to cover here; this is a good place to start (Wickam & Grolemund 2017)
    • Additionally, there will be a whole separate EC workshop on using this package
  • stringr for manipulating character strings;
    • primary functions: str_detect(), str_count(), str_subset(), str_extract(), str_replace(), str_match(), str_split()
  • readr for reading in rectangular data (e.g., .csv);
    • primary functions: read_csv(), readRDS()
  • tibble for re-engineered alternatives to data frames;
    • primary functions: as_tibble(), tibble()
  • purr for functional programming;
    • primary functions: map()
  • forcats for working with categorical variables;
    • primary functions: fct_reorder(), fct_infreq(), fct_relevel(), fct_lump()

Other packages commonly used/encountered in ecology include magritter for pipes, lubridate for working with dates/times, and readxl for working with Excel datasheets.

These are all distinct packages, and must be installed and loaded separately, but again, share common grammar, syntax, and data structures, which facilitates programming across them and integrating them into common workflows.

Load the tidyverse

You can always load any/all tidyverse packages individually, thought there is a shortcut that will load them all for you.

#load everything
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.7
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#or just individual packages
# library(dplyr)
# library(tidyr) #etc

Pipes

Most people that use the tidyverse packages (and some who do not) prefer to write code using “pipes” (i.e. the %>% operator). Pipes are a tool that takes the output of some function and uses it as the input for the next. This is useful because we can write code that acts as a “pipeline” instead of a bunch of nested functions. Example:

some_new_thing <- one_last_thing(then_do_that(do_this_first(old_thing)))

becomes:

some_new_thing <- old_thing %>% do_this_first() %>% then_do_that() %>% one_last_thing()

Pipes make code mush more human-readable and reduce unnecessary extra code. We will use them, though they are not required for the tidyverse functions to work properly.

Code Examples

Below are reproducible examples of commonly used tidyverse functions.

Let’s first create some completely hypothetical data about the number of Beaver ski days a group got a few seasons ago!

# NOTE: this data is NOT tidy (more on this in a second)
beaver_data <- tribble(
  ~name,    ~`2017`,   ~`2018`,   ~`2019`, ~`2020`,       
  "Britta",       25,        20,        16,        27,    
  "Mark",         20,        15,        11,        12,   
  "Sarah",        18,        17,        10,         8,
  "Dakoeta",      19,        10,        14,        22,
  "Ellie",        34,        15,         9,        17,
  "Erika",        21,        13,        14,        11,
  "Alex",         23,        29,        16,         7
  )
# R doesn't love vars named as numbers, so wrap them in backquotes!
# or avoid the problem by beginning var names with characters
# (e.g. "year_2017")

It’s a REALLY good habit to ALWAYS explore your data before you do anything; form this habit now (you’ll thank yourself down the line!):

str(beaver_data) # view data structures of beaver_data; i.e. column data types
colnames(beaver_data) # view column names of beaver_data
head(beaver_data) # view first 10 rows of beaver_data 

Tidy data and tidyr

Before we do anything else, we’ll first want to reorganize beaver_data so that it is tidy. Tidy data follows three general rules:

  1. variables are in columns (Think: year, age, different kinds of measurements, etc.)
  2. observations are in rows (Think: individual people/animals/plants, counties, different years at a field site, etc.)
  3. values are in cells (Think: population counts, weights/heights, USD$).

Having tidy data allows us to use the same tools in similar ways for different datasets. It also often makes the structure of the data easier to understand, so we spend less mental bandwidth on remembering where everything is and more on the task at hand. We can’t do either of these things with messy data.

Artwork by the incredible Alison Horst

pivot_longer() transforms data from untidy, wide, to long format (NOTE: this function updates gather(), which is no longer under active development)

# let's convert beaver_data to long format using pivot_longer()
tidy_beaver <- beaver_data %>% 
  pivot_longer(cols = c(`2017`, `2018`, `2019`, `2020`), 
               names_to = "year", 
               values_to = "ski_days")

Conversely, you can transform ‘tidy_beaver’ back to wide format.

pivot_wider() transforms data from long to wide format (NOTE: this function updates spread(), which is no longer under active development)

# let's convert tidy_beaver back to wide format using pivot_wider()
back_to_wide <- tidy_beaver %>% 
  pivot_wider(names_from = year, 
              values_from = ski_days)

Most R functions prefer long format, tidy data, hence the “tidy” in "tidy"verse, and long format data typically eases data processing, but there are cases where wide format data is preferred (e.g., visualizing data in tables for human comprehension).

From here on, we’ll be working with our “tidy” data i.e. tidy_beaver to practice some useful wrangling functions. But first, an interlude about wrangling.

Data Wrangling and dplyr

Data wrangling refers to the art of getting your data into R in a useful form for visualization and modeling. It is definitely an art…and a science…and sometimes takes some brute force…think dancing with a big ole’ bison who’d really like to pierce you, on rollerskates, while trying to cross stitch. It can be a LOT of work; a LOT of HARD work. But tidyverse and our use of tidy datasets have made this work MUCH more predictable and user-friendly.

Artwork by the incredible Alison Horst

Subsetting data:

select() selects columns to retain and specifies their order in the dataframe.

names_beaver <- tidy_beaver %>% 
  select(name, ski_days)

#NOTE: relocate() is a better option if the goal is just to reorder the columns

filter() filters out observations within columns given some criteria

britta_mark <- tidy_beaver %>% 
  filter(name == "Britta" | name == "Mark")
# "|" tells R to filter any observations that match "Britta" OR "Mark"

britta_mark_alt <- tidy_beaver %>% 
  filter(name %in% c("Britta", "Mark"))
# alternative: this is nice for filtering through many unique column attributes

not_britta <- tidy_beaver %>% 
  filter(name != "Britta")
# != tells R to filter any observations that DO NOT match "Britta"

pull() pulls out a single variable from a data frame and saves it as a vector

ski_days_vec <- tidy_beaver %>% 
  pull(ski_days)

Manipulating/adding variables:

arrange() orders observations as specified (default = alphabetical or ascending)

ordered_names <- tidy_beaver %>% 
  arrange(name) # for descending alphabetical order, use "arrange(desc(names))"

ordered_num_skidays <- tidy_beaver %>% 
  arrange(ski_days) # or for descending order, use "arrange(-ski_days)"

mutate() is a SUPER versatile function. It is used to create new columns using existing ones. For example, converting measurements into different units, coercing a variable to a different type, or extracting information based on the values in other columns. Below are a few examples of its usefulness!

# use mutate() to convert ski_days to hours on the slopes and days/month
skidays_per_month <- tidy_beaver %>% # assuming there are 4 months in the season
  mutate(ski_hours = ski_days*6, # and 7 hrs of skiing per day (no tailgating??)
         skidays_per_month = ski_days/4)

# use mutate with case_when to add a new column based on multiple conditions
fav_skidays <- tidy_beaver %>% 
  mutate(
    fav_skidays = case_when(
      name == "Britta" ~ "Fresh Pow",
      name == "Mark" ~ "Groomer",
      name == "Sarah" ~ "Blue Bird",
      name == "Ellie" ~ "Ungroomed",
      name == "Dakoeta" ~ "Fresh Pow",
      name == "Erika" ~ "Fresh Pow",
      name == "Alex" ~ "Blue Bird"
    )
  )

# use mutate with if_else for binary conditions... 
# Example: if the observation in the 'name' column matches 
# "Britta", "Erika", or "Sarah", report "yes", they are snowboarders. 
# If not, report "no", they are skiers.
snowboarder <- tidy_beaver %>% 
  mutate(snowboarder = if_else(name %in% c("Britta","Erika","Sarah"), 
                               "yes", 
                               "no")) 

# use mutate() to coerce a variable to a different data type
name_as_factor <- tidy_beaver %>% 
  mutate(name = as_factor(name)) # can check this by viewing str(name_as_factor)

rename() renames columns (shortcut rather than duplicating a variable with a new name).

renamed_beaver <- tidy_beaver %>% 
  rename(total_skidays = ski_days)

Summarizing data:

group_by() groups observations such that data operations are performed at the level of the group (rather than the row). This is SUPER useful if you want to complete analyses by age class or species, for instance.

grouped_names <- tidy_beaver %>% 
  group_by(name) 
# note that nothing appears to change when you view 'grouped_names.' 
# Grouping data is sort of a phantom operation. The data sits grouped under the 
# hood, but doesn't appear as such in any R interface. Not until you perform 
# some function on the grouped data.... See the summarize() function below.
# You can however, check to see if a dataframe is grouped by using clas()
class(grouped_names)
## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

summarize() is the usual follow-up to group_by(). This is also SUPER useful. Wanna find the mean? Median? Mode? Minimum? Maximum? Standard deviation? summarize() has your back! (Note: you can also do grouped mutate() if you need to).

beaver_summary <- tidy_beaver %>% 
  group_by(name) %>% 
  summarize(
    avg_skidays = mean(ski_days), # substitute any summary stat function here!!
    max_skidays = max(ski_days),
    min_skidays = min(ski_days) # and add as many as you want to calculate!
  )

count() & tally() are ways to count observations and are shortcuts for various forms of summarize(). tally() counts observations, and count() performs a grouped tally().

#three ways to get the number of entries under each name
summarized_beaver <- tidy_beaver %>%
  group_by(name) %>%
  summarize(n = n())

tallied_beaver <- tidy_beaver %>% 
  group_by(name) %>% 
  tally()

counted_beaver <- tidy_beaver %>%
  count(name)

# tally can also sum the values of variables within groups (syntax: x=var_name)
# use this to get the total number of ski days for each person over the 4 years
tallied_beaver <- tidy_beaver %>% 
  group_by(name) %>% 
  tally(ski_days)

#tally() on an ungrouped dataframe is equivalent to nrow()

Now let’s practice on some REALish data!

In honor of Black Dragon Canyon Wash, let’s pretend we’re trying to understand how different species of chromatic dragons (a decently tempermental critter) with various life lengths compare in terms of number of scars and BMI.

Load packages:

#library(tidyverse)
#install.packages("DALEX")
library(DALEX) #where our data comes from
## Welcome to DALEX (version: 2.3.0).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## Attaching package: 'DALEX'
## The following object is masked from 'package:dplyr':
## 
##     explain
data(dragons)
view(dragons)

Explore:

We should first familiarize ourselves with the data.

dim(dragons)      # view dimensions of the df
head(dragons)     # view first 10 rows of df
tail(dragons)     # view last 10 rows of df
str(dragons)      # view data structure of df
colnames(dragons) # view the columns of df

Wrangle:

This dataset is pretty big – we’ll want to do some wrangling so that it only includes the information that we’re interested in. We will:

  1. filter for black & blue dragons
  2. select relevant columns of data
  3. rename columns
  4. create new columns
  5. combine genus and species into a single column
  6. create categorical variable to group dragons by age!

To demonstrate these individual steps, we’ll perform each function separately. Notice that we perform subsequent function calls on the data frame generated from the prior step. At the end, we’ll show you how to combine all steps into a single, succinct code chunk. Creating efficient workflows but combining multiple data wrangling steps is one of the great POWERS of tidyverse!

a. filter for black & blue dragons

This dataset has information on dragons of four different colors. We are interested only in black & blue dragons. First, we’ll filter colour for only “black” and “blue” values.

black_and_blue <- dragons %>% 
  filter(colour %in% c("black","blue"))

b. select the columns we want

Let’s select only the columns we’re interested in.

select_columns <- black_and_blue %>% 
  select(2:5,7, life_length) # you can supply a range of columns, or specify them individually

c. rename columns

To make this even more manageable, we can change column names to something easier (i.e. shorter to type). For example:

rename_columns <- select_columns %>% 
  rename(teeth = number_of_lost_teeth,
         age = life_length)

d. create new columns

We can also create new columns:

  1. based conditionally on other columns; OR,
  2. by preforming some calculation.
# conditional column addition
genus_species <- rename_columns %>% 
  mutate(genus = if_else(colour %in% c("blue"), "Sauroniops", "Jaggermeryx"),
         species = case_when(
              colour == "blue" & age < 1200 ~ "reike",
              colour == "blue" & age > 1200 ~ "naida",
              colour == "black" & age < 1000 ~ "ozzyi",
              colour == "black" & (age >= 1000 & age <= 1700) ~ "whido",
              colour == "black" & age > 1700 ~ "strummeri"))

# perform some operation/calculation
BMI <- genus_species %>% 
  mutate(BMI = (weight*2000)/0.45359237 / height^2)

e. combine the genus and species into a single column.

unite_columns <- BMI %>% 
  unite(genus_species, genus, species, sep = " ") # sep = "_" is the default

f. create categorical variable to group dragons by age!

cat_age <- unite_columns %>% 
  mutate(age_group = as.factor(case_when(
              age < 1000 ~ "dragonling",
              age >= 1000 & age <= 2000 ~ "juvenile",
              age > 2000 ~ "adult")))

Now let’s pull all of these steps together!

We split each wrangling step up into a separate data frame, but you could have linked all these functions together in one chunk using the pipe operator, like this:

dragons_simple <- dragons %>% 
  filter(colour %in% c("black","blue")) %>% 
  select(2:5,7, life_length) %>%  
  rename(teeth = number_of_lost_teeth,
         age = life_length) %>% 
  mutate(genus = if_else(colour %in% c("blue"), "Sauroniops", "Jaggermeryx"),
         species = case_when(
              colour == "blue" & age < 1200 ~ "reike",
              colour == "blue" & age > 1200 ~ "naida",
              colour == "black" & age < 1000 ~ "ozzyi",
              colour == "black" & (age >= 1000 & age <= 1700) ~ "whido",
              colour == "black" & age > 1700 ~ "strummeri")) %>% 
  mutate(BMI = (weight*2000)/0.45359237 / height^2) %>% 
  unite(genus_species, genus, species, sep = " ") %>% 
  mutate(age_group = as.factor(case_when(
              age < 1000 ~ "dragonling",
              age >= 1000 & age <= 2000 ~ "juvenile",
              age > 2000 ~ "adult")))

# save data
#saveRDS(dragons_simple, "./out-data/dragons-tidy.RDS")

With this simplified and cleaned data set, we’re ready to explore! Let’s first isolate data we want to visualize by:

  1. grouping observations by age_group & genus_species
  2. finding the average scars, BMI, and number of teeth for each species-age combination
  3. pivot_longer() into tidy format for visualizing
fav_spp <- dragons_simple %>% 
  group_by(age_group,genus_species) %>% 
  summarise(ave_scars = ave(scars),
            ave_BMI = ave(BMI),
            ave_teeth = ave(teeth)) %>% 
  distinct() %>% 
  pivot_longer(cols = c(`ave_scars`, `ave_BMI`, `ave_teeth`), names_to = "summary_var", values_to = "ave")
## `summarise()` has grouped output by 'age_group', 'genus_species'. You can override using the `.groups` argument.

Plot:

Now that we have our data summarized and in tidy format, we’re ready to make a plot! We want to:

  1. create a column graph showing the summarized data by species and by age_group
  2. create a different panel for each dragon species
  3. make it pretty

Note: Only the first 3 lines of the following code are necessary to make the plot. Everything else simply modifies the appearance and make it a bit more presentable. There are tons of ways to customize plots – we explore only a few options below.

ggplot(fav_spp, aes(x = age_group, y = ave, fill = summary_var)) + # fill = counts of each phenophase
  geom_col(position = "dodge") + # separate columns for each phenophase (instead of stacked)
  facet_wrap(~genus_species) + # create separate panels for each species
  ggtitle("Chromatic dragon traits") +
  labs(x = "Age Group", y = "", fill = "Summary Variable") +
  scale_fill_manual(labels = c("Average BMI", "Average Scars", "Average Teeth Lost"), values = c("darkseagreen3", "cadetblue", "orange")) + # change colors
  theme_bw() + 
  theme(panel.border = element_rect(colour = "black", fill = NA, size = 0.7), 
        axis.text.x = element_text(angle = 45, hjust = 0.9))

#ggsave("./out-plots/dragon_plot.png")

Sandbox

For experimenting with stringr, lubridate, and more!